Introduction

There is a large, diverse literature on creating indexes of what amounts to “Socioeconomic status” in spatial areas like neighborhoods. These indexes can then be used to investigate correlations between socioeconomics and health outcomes, infrastructure access, environmental exposures, or other variables, or target certain neighborhoods that seem particularly vulnerable for interventions. Many variables can be used to contribute to such an index. However, it is always difficult to choose which variables and how to weight them. For example, the USEPA EJSCREEN project tracks 6 demographic variables, but only uses two when constructing demographic indices to weight neighborhoods for an equity- weighting of environmental exposures: a simple average of % of population low-ioncome and % population “minority” (nonwhite alone/nonhispanic). One method that has been popular in social science and public health policy has been Principal Components Analysis (PCA). See this study for a foundational example. PCA creates linear combinations of sets of variables that explain the maximum amount of overall variance in the complete dataset. This amounts to weighting the variables that have the widest spread and that covary with other variables. In this way, any arbitrary combination of variables of interest to policymakers/planners/analysts can be chosen, and a linear combination of them constructed for use as a single or multiple indices in an automated manner. This approach has the added value of being customizable for different communities, where different axes of inequity may be more or less important or impactful than in other communities.

Data sources

In the United States, typically this activity has been done to identify regions with socioeconomic correlations with cancer incidence. The typical source for variables has been the U.S Census American Community Survey, which includes some detailed demographic information about race/ethnicity, education levels, income, occupation, and housing quality, among other topics.

Two or more of the following variables are often used:

  • % of households living below 200% of the Federal Poverty Level
  • % of population of age >=25 with <= grade 12 education
  • % of households nonwhite and nonhispanic (i.e. “% minority”)
  • Median household income
  • per capita income
  • % population on public assistance (SSI Disability, TANF, etc)
  • median monthly rent
  • median value of owner-occupied housing
  • % population in owner-occupied housing
  • % population with “blue collar jobs”
  • Over 16 unemployment rate
  • % Households with more than 1.5 person per room
  • % Households without complete plumbing
  • percent HH “linguistically isolated (no people speak english”very well" or better)
  • % HH with no vehicle available.
  • % single-parent families

Some studies using them are listed below:

The workflow

First, we take an example service area boundary. In this case the Newark Water Department.

#pws <- read_sf("C:/Users/kyleo/Downloads/pws.gpkg")
#newark <- filter(pws,PWSID=="NJ0714001")
newark <- read_sf("https://info.geoconnex.us/collections/pws/items/NJ0714001")
mapview(newark)

Then we find the relevant Census Block Groups (orange) or Tracts (green), first downloading the entire state and then subsetting to the utility service area (blue).

bg <- get_acs(state = "NJ", geography = "block group", variables = "B19013_001", geometry = TRUE) %>% st_transform(4326)
## Getting data from the 2014-2018 5-year ACS
## Fetching block group data by county and combining the result.
tr <- get_acs(state = "NJ", geography = "tract", variables = "B19013_001", geometry = TRUE) %>% st_transform(4326)
## Getting data from the 2014-2018 5-year ACS
bg1 <- st_intersects(bg,newark) %>% as.integer()
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
bg <- bg[which(bg1==1),]

tr1 <- st_intersects(tr,newark) %>% as.integer()
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
tr <- tr[which(tr1==1),]

mapview(newark, layer.name="Newark Water Department", map.types= "OpenStreetMap") +
  mapview(bg, alpha.regions=0, color="orange", col.regions="orange",lwd=2, layer.name="Block Groups", map.types= "OpenStreetMap") +
  mapview(tr, alpha.regions=0, color="green",col.regions="green", lwd=3, layer.name="Tracts", map.types= "OpenStreetMap") 

Now we download the necessary census data and match them to the appropriate census boundaries.

vars <- tidycensus::load_variables(2019, "acs5", cache = TRUE)
cv <- c(pop_race_count = "B03002_001",
        pop_nonhispanic_white_alone = "B03002_003",
                 pop_educ_attainment_count = "B15003_001",
                 pop_educ_none = "B15003_002", #0
                 pop_educ_nursery = "B15003_003", #0
                 pop_educ_kindergarten = "B15003_004", #0
                 pop_educ_grade1 = "B15003_005", #1
                 pop_educ_grade2 = "B15003_006", #2
                 pop_educ_grade3 = "B15003_007", #3
                 pop_educ_grade4 = "B15003_008", #4
                 pop_educ_grade5 = "B15003_009", #5
                 pop_educ_grade6 = "B15003_010", #6
                 pop_educ_grade7 = "B15003_011", #7
                 pop_educ_grade8 = "B15003_012", #8
                 pop_educ_grade9 = "B15003_013", #9
                 pop_educ_grade10 = "B15003_014", #10
                 pop_educ_grade11 = "B15003_015", #11
                 pop_educ_grade12_nodiploma = "B15003_016", #11.5
                 pop_educ_HSdiploma = "B15003_017", #12
                 pop_educ_GED = "B15003_018", #12
                 pop_educ_college_less_1year = "B15003_019", #13
                 pop_educ_college_more_1year_nodegree = "B15003_020", #13.5
                 pop_educ_assoc = "B15003_021", #14
                 pop_educ_bachelor = "B15003_022", #16
                 pop_educ_master = "B15003_023", #18
                 pop_educ_prof = "B15003_024", #19
                 pop_educ_doc = "B15003_025", # 21
                 poverty_level_total = "C17002_001",
                 poverty_level_gr_200FPL = "C17002_008" ,
                 hh_income_20ptile = "B19080_001",
                 hh_income_median = "B19013_001",
                 pop_hh_income_assistance_total = "B09010_001",
                 pop_in_hh_with_income_assistance = "B09010_002",
                 per_capita_income = "B19301_001",
                 pop_labor_force = "B23025_001",
                 pop_unemployed = "B23025_005",
                 housing_units_count = "B25002_001",
                 housing_units_vacant = "B25002_003",
                 housing_units_occupied = "B25002_002",
                 pop_units_count = "B25033_001",
                 pop_owner_occupied_units = "B25033_002",
                 pop_rental_units = "B25033_008",
                 housing_units_plumbing_count = "B25050_001",
                 housing_units_plumbing_complete = "B25050_002",
                 rent_median = "B25058_001",
                 median_gross_rent_percent_hh_income = "B25071_001",
                 home_value_median = "B25077_001",
                 hh_owner_occupied = "B25014_002",
                 hh_owner_occupied_1.5_2_perRoom = "B25014_006",
                 hh_owner_occupied_gr2.0_perRoom = "B25014_007",
                 hh_renters = "B25014_008",
                 hh_renters_1.5_2_perRoom = "B25014_012",
                 hh_renters_gr2.0_perRoom = "B25014_013",
                 hh_lang_total = "C16002_001",
                 hh_lang_ltd_api ="C16002_010",
                 hh_lang_ltd_otherio = "C16002_007",
                 hh_lang_ltd_spanish = "C16002_004",
                 hh_lang_ltd_other = "C16002_013",
                 pop_age_total = "B01001_001",
                 pop_age_male_65_66 = "B01001_020",
                 pop_age_male_67_69 = "B01001_021",
                 pop_age_male_70_74 = "B01001_022",
                 pop_age_male_75_79 = "B01001_023",
                 pop_age_male_80_84 = "B01001_024",
                 pop_age_male_85up = "B01001_025",
                 pop_age_female_65_66 = "B01001_044",
                 pop_age_female_67_69 = "B01001_045",
                 pop_age_female_70_74 = "B01001_046",
                 pop_age_female_75_79 = "B01001_047",
                 pop_age_female_80_84 = "B01001_048",
                 pop_age_female_85up = "B01001_049",
                 occupation_total = "C24010_001",
                 occupation_sales_office_male = "C24010_027",
                 occupation_sales_office_female = "C24010_063",
                 occupation_service_male = "C24010_019",
                 occupation_service_female = "C24010_055",
                 occupation_protective_service_male = "C24010_021",
                 occupation_protective_service_female = "C24010_057",
                 occupation_natresource_male = "C24010_030",
                 occupation_natresource_female = "C24010_066",
                 occupation_prodtrans_male = "C24010_034",
                 occupation_prodtrans_female = "C24010_070",
                hh_vehicle_count = "B25044_001",
                hh_owner_noVehicle = "B25044_003",
                hh_renter_noVehicle = "B25044_010",
                families_total = "B11003_001",
                families_singleparent_female = "B11003_016",
                families_singleparent_male = "B11003_010"
        
    )


counties <- tidycensus::get_acs(year=2019,
                                geography = "county",
                                variables = "B01001_001",
                                geometry = TRUE)  %>% 
  sf::st_transform(4326) %>%
  dplyr::filter(lengths(sf::st_intersects(., newark)) > 0)
## Getting data from the 2015-2019 5-year ACS
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
# Find FIPS code for relevant state and counties, make sure no duplicates
st <- unique(substr(counties$GEOID,1,2)) 
ct <- unique(substr(counties$GEOID,3,5))

data.tr <- get_acs(year=2019,
                              geography = "tract",
                              variables = cv,
                              geometry = TRUE,
                              state = st,
                              county=ct,
                              output="wide") %>%
  filter(GEOID %in% tr$GEOID)
## Getting data from the 2015-2019 5-year ACS
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
 data.bg <- get_acs(year=2019,
                              geography = "block group",
                              variables = cv,
                              geometry = TRUE,
                              state = st,
                              county=ct,
                              output="wide") %>%
  filter(GEOID %in% bg$GEOID)
## Getting data from the 2015-2019 5-year ACS
## Fetching block group data by county and combining the result.

Now we construct the variables for our PCA at the Tract and Block Group level.

pcaData.bg <- data.bg %>% 
  mutate(percBelow200FPL = 100*(1-poverty_level_gr_200FPLE/poverty_level_totalE),
         percEducLessHS = 100*(1-(pop_educ_HSdiplomaE +
                        pop_educ_GEDE +
                        pop_educ_college_less_1yearE +
                        pop_educ_college_more_1year_nodegreeE + 
                        pop_educ_assocE +
                        pop_educ_masterE +
                        pop_educ_profE +
                        pop_educ_docE)/pop_educ_attainment_countE),
         percMinority = 100*(1-pop_nonhispanic_white_aloneE/pop_race_countE),
         income_median = hh_income_medianE,
         income_20tile = hh_income_20ptileE,
         income_percapita = per_capita_incomeE,
         percAssistance = 100*pop_in_hh_with_income_assistanceE /pop_hh_income_assistance_totalE,
         rentMedian = rent_medianE,
         rentIncomeRatioMedian = median_gross_rent_percent_hh_incomeE,
         homeValuemedian = home_value_medianE,
         percHomeowners = 100*pop_owner_occupied_unitsE/pop_units_countE,
         percBlueCollar = 100*(occupation_sales_office_maleE +
                                 occupation_sales_office_femaleE +
                                 occupation_service_maleE +
                                 occupation_service_femaleE -
                                 occupation_protective_service_maleE -
                                 occupation_protective_service_femaleE +
                                 occupation_natresource_maleE +
                                 occupation_natresource_femaleE +
                                 occupation_prodtrans_maleE +
                                 occupation_prodtrans_femaleE)/occupation_totalE,
         percUnemployed=pop_unemployedE/pop_labor_forceE,
         percPlumbingComplete = housing_units_plumbing_completeE/housing_units_plumbing_countE,
         percFamiliesSingleParent = 100*(families_singleparent_femaleE + families_singleparent_maleE)/families_totalE,
         percHH_NoVehicle = 100*(hh_owner_noVehicleE+hh_renter_noVehicleE)/hh_vehicle_countE,
         percAgeOver64 = 100*(pop_age_male_65_66E + 
                                pop_age_male_67_69E +
                                pop_age_male_70_74E +
                                pop_age_male_75_79E +
                                pop_age_male_80_84E +
                                pop_age_male_85upE +
                                pop_age_female_65_66E + 
                                pop_age_female_67_69E +
                                pop_age_female_70_74E +
                                pop_age_female_75_79E +
                                pop_age_female_80_84E +
                                pop_age_female_85upE)/pop_age_totalE,
         percHH_LangIsolated = 100 * (hh_lang_ltd_apiE +
                                        hh_lang_ltd_otherioE +
                                        hh_lang_ltd_spanishE +
                                        hh_lang_ltd_otherE)/hh_lang_totalE,
         percHH_Crowded = 100 * (hh_owner_occupied_1.5_2_perRoomE +
                                   hh_owner_occupied_gr2.0_perRoomE +
                                   hh_renters_1.5_2_perRoomE +
                                   hh_renters_gr2.0_perRoomE)/(hh_owner_occupiedE + hh_rentersE)) %>%
  select(GEOID,percBelow200FPL:percHH_Crowded) %>%
  st_drop_geometry()

pcaData.tr <- data.tr %>% 
  mutate(percBelow200FPL = 100*(1-poverty_level_gr_200FPLE/poverty_level_totalE),
         percEducLessHS = 100*(1-(pop_educ_HSdiplomaE +
                        pop_educ_GEDE +
                        pop_educ_college_less_1yearE +
                        pop_educ_college_more_1year_nodegreeE + 
                        pop_educ_assocE +
                        pop_educ_masterE +
                        pop_educ_profE +
                        pop_educ_docE)/pop_educ_attainment_countE),
         percMinority = 100*(1-pop_nonhispanic_white_aloneE/pop_race_countE),
         income_median = hh_income_medianE,
         income_20tile = hh_income_20ptileE,
         income_percapita = per_capita_incomeE,
         percAssistance = 100*pop_in_hh_with_income_assistanceE /pop_hh_income_assistance_totalE,
         rentMedian = rent_medianE,
         rentIncomeRatioMedian = median_gross_rent_percent_hh_incomeE,
         homeValuemedian = home_value_medianE,
         percHomeowners = 100*pop_owner_occupied_unitsE/pop_units_countE,
         percBlueCollar = 100*(occupation_sales_office_maleE +
                                 occupation_sales_office_femaleE +
                                 occupation_service_maleE +
                                 occupation_service_femaleE -
                                 occupation_protective_service_maleE -
                                 occupation_protective_service_femaleE +
                                 occupation_natresource_maleE +
                                 occupation_natresource_femaleE +
                                 occupation_prodtrans_maleE +
                                 occupation_prodtrans_femaleE)/occupation_totalE,
         percUnemployed=pop_unemployedE/pop_labor_forceE,
         percPlumbingComplete = housing_units_plumbing_completeE/housing_units_plumbing_countE,
         percFamiliesSingleParent = 100*(families_singleparent_femaleE + families_singleparent_maleE)/families_totalE,
         percHH_NoVehicle = 100*(hh_owner_noVehicleE+hh_renter_noVehicleE)/hh_vehicle_countE,
         percAgeOver64 = 100*(pop_age_male_65_66E + 
                                pop_age_male_67_69E +
                                pop_age_male_70_74E +
                                pop_age_male_75_79E +
                                pop_age_male_80_84E +
                                pop_age_male_85upE +
                                pop_age_female_65_66E + 
                                pop_age_female_67_69E +
                                pop_age_female_70_74E +
                                pop_age_female_75_79E +
                                pop_age_female_80_84E +
                                pop_age_female_85upE)/pop_age_totalE,
         percHH_LangIsolated = 100 * (hh_lang_ltd_apiE +
                                        hh_lang_ltd_otherioE +
                                        hh_lang_ltd_spanishE +
                                        hh_lang_ltd_otherE)/hh_lang_totalE,
         percHH_Crowded = 100 * (hh_owner_occupied_1.5_2_perRoomE +
                                   hh_owner_occupied_gr2.0_perRoomE +
                                   hh_renters_1.5_2_perRoomE +
                                   hh_renters_gr2.0_perRoomE)/(hh_owner_occupiedE + hh_rentersE)) %>%
  select(GEOID,percBelow200FPL:percHH_Crowded) %>%
  st_drop_geometry()

Now, we inspect the distributions of the variables to see which ones may be problematic. It appears that the following variables are highly skewed and should be transformed:

  • % Minority
  • % houses with complete plumbing
  • % Households in crowded positions (>= 1.5 people per room)

It also appears that the following variables are not available or otherwise highly missing at the block group level:

  • % population on public assistance (100% missing)
  • THe 20th percentile of HH income (100% missing)
  • Median home value (11% missing)
  • Median income (7% missing)

If clients are determined to use these variables, tract-level analysis may be necessary, which may not work for custom geographies that need to be assembled from smaller geographies than tracts.

skim(pcaData.tr)
Data summary
Name pcaData.tr
Number of rows 130
Number of columns 20
_______________________
Column type frequency:
character 1
numeric 19
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
GEOID 0 1 11 11 0 130 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
percBelow200FPL 2 0.98 47.47 15.23 4.80 38.74 48.73 57.90 81.23 ▁▂▇▇▂
percEducLessHS 0 1.00 33.75 8.25 14.43 27.83 32.10 38.86 56.73 ▁▇▇▃▁
percMinority 0 1.00 86.68 16.22 32.45 77.87 94.78 98.77 100.00 ▁▁▂▂▇
income_median 2 0.98 45771.48 23216.00 15333.00 31557.50 41725.50 51843.50 181088.00 ▇▃▁▁▁
income_20tile 2 0.98 18936.60 10874.67 4104.00 11358.50 15665.50 24163.00 78286.00 ▇▅▁▁▁
income_percapita 1 0.99 23205.33 8746.13 6074.00 17037.00 22203.00 26696.00 67211.00 ▆▇▂▁▁
percAssistance 2 0.98 39.43 19.57 0.00 25.49 40.07 51.06 91.27 ▃▆▇▂▁
rentMedian 3 0.98 1001.54 242.39 228.00 906.00 986.00 1130.00 1937.00 ▁▃▇▁▁
rentIncomeRatioMedian 2 0.98 35.44 7.22 20.80 29.98 34.25 39.50 51.00 ▂▇▇▅▃
homeValuemedian 9 0.93 255654.54 77196.06 9999.00 208500.00 245100.00 304900.00 588000.00 ▁▇▇▂▁
percHomeowners 2 0.98 31.93 16.96 0.00 20.84 29.20 40.24 91.87 ▃▇▃▂▁
percBlueCollar 2 0.98 70.11 10.82 30.91 65.77 71.39 77.43 87.97 ▁▁▂▇▅
percUnemployed 0 1.00 0.07 0.04 0.00 0.04 0.06 0.09 0.20 ▅▇▅▂▁
percPlumbingComplete 2 0.98 0.99 0.01 0.95 0.99 1.00 1.00 1.00 ▁▁▁▁▇
percFamiliesSingleParent 2 0.98 27.78 13.58 3.22 17.46 26.67 37.94 56.80 ▆▇▇▆▃
percHH_NoVehicle 2 0.98 32.41 14.84 2.29 21.46 32.51 42.67 70.86 ▃▆▇▅▁
percAgeOver64 0 1.00 11.14 4.47 0.43 8.06 10.38 14.33 24.90 ▂▇▆▃▁
percHH_LangIsolated 2 0.98 14.35 12.86 0.00 4.73 8.90 23.47 52.43 ▇▂▂▁▁
percHH_Crowded 2 0.98 2.61 2.38 0.00 0.64 2.16 4.06 9.29 ▇▅▂▂▁
skim(pcaData.bg)
Data summary
Name pcaData.bg
Number of rows 274
Number of columns 20
_______________________
Column type frequency:
character 1
numeric 19
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
GEOID 0 1 12 12 0 274 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
percBelow200FPL 5 0.98 48.52 19.51 0.00 37.30 49.48 60.00 100.00 ▂▅▇▃▁
percEducLessHS 1 1.00 33.67 11.92 0.00 25.94 32.89 40.95 69.86 ▁▅▇▃▁
percMinority 1 1.00 88.77 15.88 31.66 85.51 96.22 99.84 100.00 ▁▁▁▁▇
income_median 30 0.89 45129.20 22911.84 9389.00 30984.50 41250.00 53962.75 147500.00 ▇▇▂▁▁
income_20tile 274 0.00 NaN NA NA NA NA NA NA
income_percapita 4 0.99 22471.93 9377.28 2087.00 16424.75 20430.50 26568.50 60851.00 ▂▇▃▁▁
percAssistance 274 0.00 NaN NA NA NA NA NA NA
rentMedian 20 0.93 987.73 282.91 209.00 889.00 978.00 1124.00 2264.00 ▂▇▆▁▁
rentIncomeRatioMedian 12 0.96 37.21 9.04 18.30 29.90 35.90 44.25 51.00 ▂▆▇▅▇
homeValuemedian 65 0.76 260973.67 93729.69 9999.00 197200.00 243800.00 309800.00 818200.00 ▂▇▂▁▁
percHomeowners 5 0.98 31.23 21.22 0.00 16.09 26.21 41.61 100.00 ▇▇▃▂▁
percBlueCollar 6 0.98 70.36 14.82 20.45 63.13 72.91 80.60 100.00 ▁▂▅▇▃
percUnemployed 1 1.00 0.08 0.06 0.00 0.04 0.07 0.10 0.60 ▇▂▁▁▁
percPlumbingComplete 5 0.98 0.99 0.02 0.84 1.00 1.00 1.00 1.00 ▁▁▁▁▇
percFamiliesSingleParent 5 0.98 28.44 19.14 0.00 14.59 26.53 40.48 100.00 ▇▇▅▁▁
percHH_NoVehicle 5 0.98 32.55 18.14 0.00 20.73 30.89 43.27 85.71 ▅▇▆▃▁
percAgeOver64 1 1.00 11.30 6.72 0.00 6.63 10.53 14.98 41.86 ▆▇▂▁▁
percHH_LangIsolated 5 0.98 14.27 15.46 0.00 1.13 9.30 22.72 81.03 ▇▃▁▁▁
percHH_Crowded 5 0.98 2.51 3.94 0.00 0.00 0.00 3.77 31.08 ▇▁▁▁▁

With this knowledge, we conduct the PCA, first on all the available variables for Block Groups and for Tracts. Below, we report on the first three principal components of the PCA for the tracts and block groups.

is.nan.data.frame <- function(x)
do.call(cbind, lapply(x, is.nan))

pcaData.bg[is.nan(pcaData.bg)]<-NA
x=select(pcaData.bg,-GEOID,-percAssistance, -income_20tile)
pca.bg <- prcomp(x=na.omit(x), scale=TRUE, center=TRUE, na.action=na.omit, rank=3)
y<-predict(pca.bg,pcaData.bg)
pcaData.bg$PC1_allVars <- y[,1]

pcaData.tr[is.nan(pcaData.tr)]<-NA
x=select(pcaData.tr,-GEOID)
pca.tr <- prcomp(x=na.omit(x), scale=TRUE, center=TRUE, na.action=na.omit, rank=3)
y<-predict(pca.tr,pcaData.tr)
pcaData.tr$PC1_allVars <- y[,1]
pcaData.tr$PC1 <- y[,1]
pcaData.tr$PC2 <- y[,2]
pcaData.tr$PC3 <- y[,2]

Results

For tracts, we see that the first component alone explains 42% of the variance in the specified variables. Higher values on this component are associated with lower income measures, higher poverty, public assistance and unemployment, lower rents and housing values, more blue collar employment, higher proportions of minority groups, lower homeownership rates, less vehicle ownership, and higher prevalence of single-parent families, without being particularly informed by linguistic isolation, education levels, or housing crowdedness.

kable(summary(pca.tr)$importance[1:3,1:3],digits=2)
PC1 PC2 PC3
Standard deviation 2.81 1.72 1.16
Proportion of Variance 0.42 0.16 0.07
Cumulative Proportion 0.42 0.57 0.64
kable(pca.tr$rotation,digits=2)
PC1 PC2 PC3
percBelow200FPL 0.33 -0.09 0.02
percEducLessHS -0.07 -0.46 0.18
percMinority 0.25 0.27 -0.13
income_median -0.33 0.04 -0.06
income_20tile -0.31 0.03 -0.05
income_percapita -0.32 0.06 0.05
percAssistance 0.26 0.02 0.13
rentMedian -0.24 -0.06 -0.41
rentIncomeRatioMedian 0.18 0.03 -0.32
homeValuemedian -0.24 -0.24 0.03
percHomeowners -0.27 0.21 -0.09
percBlueCollar 0.25 -0.21 -0.06
percUnemployed 0.13 0.31 -0.20
percPlumbingComplete 0.00 0.17 0.40
percFamiliesSingleParent 0.26 0.14 -0.11
percHH_NoVehicle 0.27 -0.17 0.24
percAgeOver64 -0.10 0.12 0.54
percHH_LangIsolated 0.01 -0.52 -0.03
percHH_Crowded 0.04 -0.30 -0.29

For block groups, we see that the first component alone explains only 29% of the variance in the specified variables (compared to 42% in the tract-level analysis). However, the way the variables contribute to the first component is similar to how they do in the tract-level analysis.

kable(summary(pca.bg)$importance[1:3,1:3],digits=2)
PC1 PC2 PC3
Standard deviation 2.21 1.70 1.18
Proportion of Variance 0.29 0.17 0.08
Cumulative Proportion 0.29 0.46 0.54
kable(pca.bg$rotation,digits=2)
PC1 PC2 PC3
percBelow200FPL -0.40 0.08 -0.06
percEducLessHS 0.07 0.42 0.03
percMinority -0.23 -0.35 -0.04
income_median 0.40 -0.05 -0.15
income_percapita 0.38 -0.05 -0.03
rentMedian 0.30 0.03 -0.29
rentIncomeRatioMedian -0.14 0.04 -0.26
homeValuemedian 0.19 0.34 0.09
percHomeowners 0.28 -0.21 -0.03
percBlueCollar -0.22 0.22 0.28
percUnemployed -0.17 -0.29 -0.03
percPlumbingComplete 0.01 -0.13 0.47
percFamiliesSingleParent -0.29 -0.12 -0.28
percHH_NoVehicle -0.28 0.21 0.11
percAgeOver64 0.12 -0.09 0.49
percHH_LangIsolated -0.01 0.51 0.04
percHH_Crowded -0.04 0.24 -0.42

Below, we visualize how the first component (PC1) can be used as a multidimensional socioeconomic status/ deprivation index, that incorporates more information and yields different results than just using income or poverty levels or minority group prevalence. Tracts have less missing data than block groups, generally due to U.S. Census Bureau privacy controls that restrict reporting of certain cross-tabulations in small areas. Thus, the smaller the area, the fewer variables are suitable for use in constructing socioeconomic indexes.

bg <- left_join(bg,pcaData.bg,by="GEOID")
tr <- left_join(tr,pcaData.tr,by="GEOID")
tr$PC1_allVars <- -tr$PC1_allVars

m.pc.bg <- mapview(bg,zcol="PC1_allVars", layer.name="Block Groups \n PC1") + mapview(newark,alpha.regions=0,col.regions="red",color="red",layer.name="Newark Water Dept.", lwd=2)
m.pov.bg <- mapview(bg,zcol="income_median", layer.name="Block Groups \n Median HH Income") + mapview(newark,alpha.regions=0,col.regions="red",color="red",layer.name="Newark Water Dept.", lwd=2)

m.pc.tr <- mapview(tr,zcol="PC1_allVars", layer.name="Tracts \n PC1") + mapview(newark,alpha.regions=0,col.regions="red",color="red",layer.name="Newark Water Dept.", lwd=2)
m.pov.tr <- mapview(tr,zcol="income_median", layer.name="Tracts \n Median HH Income") + mapview(newark,alpha.regions=0,col.regions="red",color="red",layer.name="Newark Water Dept.", lwd=2)

mapl <- sync(m.pc.bg, m.pov.bg, m.pc.tr, m.pov.tr)
mapl

Using indexes and correlation indexes to create insights.

Scaling variables by indexes.

Below we demonstrate how using an index can be used to highlight neighborhoods that users may be more interested in from an equity framework. For example, if the distribution of a problematic phenomenon such as taste complaints is randomly distributed, then from an equity perspective, one may wish to prioritize addressing these issues in the most socioeconomically burdened neighborhoods before less burdened neighborhoods.

We simulate tract-level taste complaints per household as a normally distributed random variable with a mean of 0.2 and a standard deviation of 0.1 and a minimum of 0, with no assumed correlation with any equity-related variable:

set.seed(2354367)
tr$taste_complaints_perHH <- rnorm(n=130,mean=0.2, sd=0.1)
tr$taste_complaints_perHH[which(tr$taste_complaints_perHH<0)] <- 0
tr$PC1_allVars <- -tr$PC1_allVars

map.tr <- mapview::mapview(tr,zcol="taste_complaints_perHH", layer.name="Tracts \n Taste Compl per HH.")

map.pc <- mapview::mapview(tr,zcol="PC1_allVars", layer.name="Tracts \n Burden Index")

sync(map.pc, map.tr)
ggplot2::ggplot(data=tr, aes(x=-PC1_allVars,y=taste_complaints_perHH)) + geom_point() + geom_abline() + xlab("Equity Principal Component (higher = more burdened)")
## Warning: Removed 9 rows containing missing values (geom_point).

As we can see above, there is no correlation between our simulated taste complaint indicator and the socioeconomic status index (scale reversed, so that higher numbers correspond to a more “burdened” area).

We can multiply the taste complaint indicator by the “burden” index to create a taste complaint scaled by burden that will highlight different areas to focus on than by focusing on the indicator alone:

tr$taste_scaled <- tr$PC1_allVars* tr$taste_complaints_perHH

map.tr <- mapview::mapview(tr,zcol="taste_complaints_perHH", layer.name="Tracts \n Taste Compl per HH.")

map.pc <- mapview::mapview(tr,zcol="taste_scaled", layer.name="Taste complaints scaled by burden")

sync(map.pc, map.tr)

Insights: most correlated variables

There are many sophisticated methods from the “machine learning” toolbox (e.g. stepwise regression algorithms, ridge regressions, LASSO, random forest models, ) that could serve to identify particular models based on sociodemographic/economic variables that most effectively predict values of equity/performance variables. However, it may be more straightforward to justify and communicate highlighting which sociodemographic variables are most linearly correlated with a variable of interest.

Let’s simulate come equity indicators that are various functions of the socioeconomic variables:

x <- tr[6:31] %>% 
  st_drop_geometry() %>% 
  na.aggregate()

pca <- pca.tr$x

y1 <- x$PC1 + rnorm(length(x$PC1),0,1)
y2 <- 0.5*x$PC1 + 0.5*x$PC2 + rnorm(length(x$PC1),0,1)
y3 <- 0.5*x$PC3 + 0.5*x$PC2 + rnorm(length(x$PC1),0,1)

Now we compute the correlation between each of these simulated indicators and all of the socioeconomic variables. Correlation is measured on a scale between -1 and 1, with values closer to 0 indicating less correlation. The table below shows how each of the three simulated equity indicators are correlated with each of the socioeconomic variables. For example, \(y1\) is highly correlated with the percentage of the households below 200% of the Federal Poverty level,Single parent families, and households with no vehicle, and strongly inversely correlated with income measures, home values, and home ownership rates. \(y3\), on the other hand, is most highly correlated with lower levels of linguistic isolation and education levels below high school. This would enable a dynamic insight to be generated, where a user could select an equity indicator, and be shown any relevant descriptions or plots for any associated socioeconomic variables.

y <- as.data.frame(cbind(y1,y2,y3))
x <- tr[6:24] %>% 
  st_drop_geometry() %>% 
  na.aggregate()
c <- cor(x=x,y=y)

test <- cor.mtest(cbind(y,x))$p[1:3,4:22]

corrplot(c,
  method = "number",
  type = "full",
  order="original"# show only upper side
)